Introduction to Open Data Science - Course Project

About the project

This is the project page for the Introduction to Open Data Science course of fall 2022. A link to the repository for this project page is here: IODS-project (github.com/teemuvh/IODS-project).

During this course, we get acquainted with R and certain data science methods, such as linear and logistic regression, clustering and classification algorithms, and dimensionality reduction models. During this course, we use the textbooks (1) Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences and (2) Ewen Harrison and Riinu Pius (2021). R for Health Data Science.

Week 1: Start me up!

During the first week, the focus is on getting everything up and running, and getting familiar with the syntax of R before moving to the more challenging topics of the course.

I use Python in my daily work, but so far I like how simple it is to read data into R.

# Here, we could load a csv-file to R as follows:
# data <- read_csv("some_data.csv")
# And view it as follows:
# View(data)
# But since we do not have "some_data" at our disposal,
# I have just added this as a comment.

# However, we can print the date:
date()
## [1] "Wed Nov 23 17:03:39 2022"

The Harrison & Pius book seems to work well for revising R syntax as well as learning some new tricks already for data transformations (mutate(), summarise(), pivot_wider(), factor(), etc.) The difficult part will be remembering when to utilise these functions…

In the Vehkalahti & Everitt textbook chapters 1 and 2, I liked how the same data were presented in different visualisation formats, which made me reflect a lot about how I could have presented some of my data before. I will attempt to use this information in the future to actually focus a bit more on what could be an optimal visualisation method for some given data that I work with. Additionally, it was fun to get to draw different visualisations with R, and again it seems simpler compared to Python. In terms of how the material works as a “crash course” to R, I think that having a background in programming helps in getting started with the material, since it makes the programming part a bit easier. I believe the material could work if one comes with no background in programming, but the learning curve is a bit steeper in the beginning.

It is nice to get to refresh my memory with the statistical methods as well. I have taken an introductory course to statistics but it was many years ago, so I have to recap/re-learn many of the topics. Thus, my favourite part of this week has been to actually refreshing my memory on these topics in statistics and I expect to learn a lot especially with respect to this field. I believe this information will be useful for me in the future in my own research.


Regression and model validation

Jump straight to the assignments.

date()
## [1] "Wed Nov 23 17:03:39 2022"

The core idea of regression analysis is to use statistical methods to assess the way in which a change in circumstances affect variation in an observed random variable. This section summarises my notes for the chapters regarding Simple linear regression and Multiple linear regression.

There are four basic assumptions in linear regression:

  1. The relationship between predictors and outcome is linear
  2. The residuals are independent
  3. The residuals are normally distributed
  4. The residuals have equal variance

Simple linear regression

Simple linear regression model (intercept \(\beta_0\) is not always needed), has only a single explanatory variable (\(x\)) and a single dependent variable (\(y\)):

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope of the linear relationship between the dependent variable and the explanatory variable, and \(\epsilon\) is an error term that measures the difference between the observed value and the prediction from the model.

The prediction of the model is presented without the loss term as:

\[ \hat{y} = \beta_0 + \beta_1 x_i \]

This equation can then be used for predicting the value of a dependent variable for some explanatory variable.

The variability of the dependent by Regression Means Square (RGMS) and Residual Mean Square (RMS):

\[ RGMS = \sum_{i=1}^{n} (\hat{y_i} - \overline{y}_i)^2 \] \[ RMS = \sum_{i=1}^{n} \frac{(y_i - \hat{y_i})^2}{(n-2)} \]

Regression diagnostics

We can assess our model by calculating the difference between an observed value and our prediction:

\[ \hat{\epsilon} = y - \hat{y} \]

R prints another measure for assessing how close the data are to the fitted line, R-squared:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample. R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. 1.0 indicates that all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.

And we should also use visualisation to assess the model; useful plots include:

  • Boxplot (or a Q-Q plot) of the residuals
  • Residuals against the corresponding values of the explanatory variables
  • Residuals against the fitted values of the response variable

We can decide whether a linear model is appropriate or we should use a non-linear model.

Multiple linear regression

Here, we have more explanatory variables than in the simple regression (e.g., analysing change in blood pressure based on coffee consumption on smokers and non-smokers). Basically, we have multiple parameters that effect the prediction.

\[ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Confounding is a concept referring to a situation where another explanatory variable distorts the relationship between an explanatory variable and the outcome (e.g., smoking could be related to coffee consumption and to blood pressure).

Assignments

1. Data wrangling

Done and wrangled data in my GitHub repo.

2. Regression analysis

Task 1:

First, we need to read the data we created in the data wrangling exercise. The data is stored in the data dictionary. Then we quickly check the structure and the dimensions of the data.

learning2014 <- read.table("data/learning2014.csv", sep=",", header=T)

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dim() outputs the dimensionality of the data ([166, 7]), whereas str() outputs, in addition to the number of observations and variables (i.e., dimensions), a little more information about the data. str() also outputs a few examples of the observations in the seven variables.

The variables in the data are gender, age, attitude, deep, stra, surf, and points. I think that in general, this data is about students’ attitudes towards studying statistics and how they feel about learning the topics on some statistics course. The different variables indicate a student’s gender, the age of the student, student’s attitude towards statistics (averaged over 10 answers to questions on a Likert scale). From the link given in the material, I interpret that Deep, strategic (stra), and surface (surf) indicate learning methods based on some clusters of questions related to these methods. Points indicates the points a student got from the course exam. In the dataset considered in this exercise, we have the previous variables for 166 students.

Task 2:

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

I summarised the data to one scatterplot matrix From this visualisation above, we can observe different dimensions and how they interact. Scatterplots are drawn on the left side of the diagonal, the diagonal depicts the variable distribution, whereas the right side of the matrix displays the Pearson correlation.

First, we see that the sample has quite a lot more female students than males. Most of the students in both group were around the age of 20-25. Male students seem to have had slightly more positive attitude towards statistics, having most of the mass in the middle of the Likert scale (around 3.5), but also more answers in the higher end of the scale compared to female students. Male students seem to have had a bit stronger tendency towards deep learning strategy, whereas female students seem to have been more prone to the strategic dimension. Female students also seem to have been more prone to this surface dimension. Points in the exam seem quite similarly distributed between the groups. However, it seems that more male students have acquired the highest points.

Most of the variables seem not to be correlated based on the Pearson correlation coefficient. However, a few of them are. Quite understandably, attitude seems to correlate with the points acquired.

Task 3:

Now, we look at multivariable regression. Let the first explanatory variable be attitude, the second variable can be stra, and the third one surf. The decided variables are based on their correlation indicated by the Pearson correlation coefficient in the visualisation above.

We can first visualise all of the variables relationship with the exam scores independently:

library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

library(ggplot2)
qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

library(ggplot2)
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Based on the three visualisations, it seems that both attitude and stra have a positive correlation with the exam results. Surf, unsurprisingly, seems to have a slightly negative correlation. Next, we fit a multivariable regression model to the given explanatory variables.

# create a regression model with multiple explanatory variables
multivariable_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# summary
multivariable_model %>%
  summary()
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The summary includes residuals in five points (min, first quartile, median, third quartile, and max). If the model fits perfectly to the data, the distribution between the residuals should be symmetrical.

Next, there is the coefficients block with values for estimate, Std. error, t-value, and p-value.

The estimate column indicates that a change of one point in attitude results in a change of approximately 3.4 points in the exam scores. With respect to stra and surf, the correlation is one to 0.9 and one to -0.6 respectively.

The Std. error indicates how far the estimates are from the true average values of the dependent variables.

T-value indicates the distance of our coefficient estimates from 0 as standard deviations. This value can be positive or negative, but it should be far from 0. A large difference to 0 would indicate relationship between the given variables (e.g., attitude and points), whereas 0 (or near to 0) means that there is no relationship.

Finally, we have the probability of getting any value that is equal or larger than t. This value should be as small as possible for us to be able to confidently reject the null hypothesis.

The significant codes are probably in the output for a quicker read of the results.

Lastly, there are the residual standard error (RSE), multiple and adjusted R-squared, and F-statistics. The RSE measures how well the model fits the data based on the residuals (\(y-\hat{y}\)). In practice, it is the square root of the mean squared error (please, correct me if I am wrong):

\[ RSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2} \] R-squared (\(R^2\)) is another value to assess how well the model fits the data. In practice, R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. If the result is 1.0, all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.

R-squared is calculated as:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample.

F-test compares the means of the groups and indicates whether at least one variable is statistically significant.

The summary of the model indicates that only attitude has a significant correlation with the exam points, whereas the correlations between stra and points and surf and points are not statistically significant. R-squared seems low to me. Perhaps this is because we only have so little data points to fit the model to?

Next, we remove the non-significant variables and fit the model again.

# create a regression model with multiple explanatory variables
attitude_model <- lm(points ~ attitude, data = learning2014)

# summary
attitude_model %>%
  summary()
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now, we only have the significant variable in the model.

Task 4:

Most of this task was done in the earlier task, but in short, a change of one point (on the Likert scale) in attitude results in a change of 3.5 points in the exam scores. R-squared was also explained above.

Task 5:

Finally, draw some visualisations of the residuals. First, I print the regression line of the attitude variable again, just to help in analysis:

library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Next, I start outputting the different residual plots, beginning with Residuals vs Fitted.

# plot all plots:
# plot(attitude_model, which = c(1, 2, 5))
# but for the sake of the notebook, I plot these one by one
plot(attitude_model, which=1)

On the plot above, the residuals (y-axis) are plotted with respect to the estimated responses (x-axis). The residual line does not correspond exactly to the regression line. I think (but please correct me if I am wrong) what we can read from the line is that the variance is larger in the end of the data, meaning that the model might not have fitted well. This is likely an issue of small data size in this case.

plot(attitude_model, which=2)

The Q-Q plot shows that the data is not normally distributed, but skews in both ends of the data points. What I read from this is that there might be values in both ends that are very far from the expected mean if the data were normally distributed.

plot(attitude_model, which=5)

Residuals vs Leverage shows how important different data points are to the regression model. The spread of the data points should not change as a function of leverage, but in the right end of the line it seems to change. Perhaps this is again an issue with small data size? There are no data points outside of Cook’s distance, so there are no data points whose deletion from the data would have a high influence on the regression model.


Logistic regression

Jump straight to the assignments.

date()
## [1] "Wed Nov 23 17:03:46 2022"

This part of the course will focus on Logistic Regression, which is a type of models where the output is categorical. For example, if we want to predict whether an email is spam or not, or whether a tumor is malign or benign, we could use a logistic regression model. Similarly, instead of only two, we could have multiple possible labels.

Some notes about Logistic Regression

Linear regression models the population mean of the response variable directly as a linear function of the explanatory variables:

\[ E(y|x_1, x_2, ..., x_n) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]

With classification of categorical values, this is not a suitable way, but we need to output something that we can interpret as probabilities for the different outcomes. The issue with linear regression models is that the output can be any real number between \(-\infty\) and \(\infty\). Thus, we need a link function, which here is the logarithm of the odds: \(log(\frac{\pi}{1-\pi})\). Logistic regression then takes the form of:

\[ logit (\pi) = log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]

Logistic regression function rearraged is:

\[ \pi = \frac{ \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}{1- \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)} \] Observed variable \(y\) is represented as:

\[ y = \pi(x_1, x_2, ..., x_n) + \epsilon \] , where \(\epsilon = 1 - \pi(x_1, x_2, ..., x_n)\), with probability \(\pi(x_1, x_2, ..., x_n)\) if \(y = 1\). Otherwise, if \(y = 0\), \(\epsilon = \pi(x_1, x_2, ..., x_n)\), with probability \(1 - \pi(x_1, x_2, ..., x_n)\).

Assignments

Data wrangling

Data for the following analysis exercise here.

Logistic regression

In this assignment, we will apply logistic regression to the alcohol consumption data created in the previous exercise. The original data is presented in Cortez & Silva, (2008).1

library(readr)
alc <- read_csv("data/alc.csv", show_col_types = FALSE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The wrangled data consists of 370 observations and 33 original variables. In practice, the observations are Portuguese secondary school students attending courses in mathematics and in Portuguese. The variables include student demographics, social and school related features, and grades. We have added another two variables, alc_use and high_use, where alc_use averages the weekday and weekend alcohol consumption, while high_use is a truth value indicating whether the students’ alcohol consumption is low (FALSE) or high (TRUE).

The goal of this exercise is to attempt at explaining the relationship between different explanatory variables to alcohol consumption. As there are so many options, we define the number of variables to be explored to four. There are many variables that I believe can explain alcohol consumption, but for this experiment, I will choose variables related to freetime activity. My hypothesis is that the more a student has ‘other’ activities (internet, romantic, activities) the less they have freetime and the less they consume alcohol.

First, we might need to change the string-formatted answers to integers.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ purrr   0.3.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(finalfit)

alc <- alc %>%
  mutate(internet.int = factor(internet) %>%         
           fct_recode("0" = "no",
                      "1" = "yes"),

         romantic.int = factor(romantic) %>% 
           fct_recode("0" = "no",
                      "1"  = "yes"),
         
         activities.int = factor(activities) %>% 
           fct_recode("0" = "no",
                      "1"  = "yes"),
  )

Next, let’s visualise the variables with respect to alcohol consumption.

# access the tidyverse packages dplyr and ggplot2
library(dplyr); library(ggplot2)

g1 <- ggplot(data = alc, aes(x = high_use, fill = romantic.int))
g1 + geom_bar() + facet_wrap("romantic")

g2 <- ggplot(data = alc, aes(x = high_use, fill = internet.int))
g2 + geom_bar() + facet_wrap("internet")

g3 <- ggplot(data = alc, aes(x = high_use, fill = activities.int))
g3 + geom_bar() + facet_wrap("activities")

g4 <- ggplot(data = alc, aes(x = high_use, fill = freetime))
g4 + geom_bar() + facet_wrap("freetime")

Based on the barplots, it seems that students who are in a romantic relationship are less prone to high usage of alcohol. Internet, however, does not show a similar effect. Rather, it seems that students who have internet connection at home seem to consume more alcohol. This is probably explained by the fact that not many households are without internet connection.

library(dplyr)
alc %>% 
  group_by(internet.int) %>%
  summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
##   internet.int percent
##   <fct>          <dbl>
## 1 0               15.4
## 2 1               84.6

As it seems, there are approximately 85% of households with internet connection.

Extracurricular activities seem to not have a large effect on alcohol consumption, but perhaps students who have no extracurricular activities might be slightly more prone to high alcohol consumption.

Lastly, it seems that the more free time students have, the more the smaller the gap between high and low alcohol consumption.

Next, I use logistic regression to analyse the variables further, and see whether my hypothesis really holds.

m <- glm(high_use ~ romantic.int + internet.int + activities.int + freetime, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ romantic.int + internet.int + activities.int + 
##     freetime, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2100  -0.8970  -0.7151   1.3130   1.9156  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.0503     0.5010  -4.093 4.26e-05 ***
## romantic.int1    -0.1777     0.2510  -0.708  0.47882    
## internet.int1     0.1793     0.3320   0.540  0.58925    
## activities.int1  -0.3529     0.2329  -1.515  0.12983    
## freetime          0.3895     0.1225   3.179  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 439.01  on 365  degrees of freedom
## AIC: 449.01
## 
## Number of Fisher Scoring iterations: 4

Based on logistic regression, it seems that only freetime is statistically significant to high alcohol consumption.

library(broom)
m %>% 
  tidy(conf.int = TRUE, exp = TRUE)
## # A tibble: 5 × 7
##   term            estimate std.error statistic   p.value conf.low conf.high
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)        0.129     0.501    -4.09  0.0000426   0.0466     0.334
## 2 romantic.int1      0.837     0.251    -0.708 0.479       0.508      1.36 
## 3 internet.int1      1.20      0.332     0.540 0.589       0.635      2.35 
## 4 activities.int1    0.703     0.233    -1.51  0.130       0.444      1.11 
## 5 freetime           1.48      0.123     3.18  0.00148     1.17       1.89

Above, we have the coefficients (estimate) and the confidence intervals (conf.low and conf.high) for the different variables. For the only statistically significant explanatory variable is the amount of free time, I mostly focus on that. The coefficient, approx. 1.48, indicates that moving one step in freetime, 0 (very low) to 5 (very high), multiplies the odds of high alcohol consumption by 1.48. Or we can state that increase in freetime by one point increases the odds of high alcohol consumption by 48%. The negative coefficients (<1 in the tidy output) indicate that these explanatory variables (romantic relationship and extracurricular activities) have a decreasing effect on the risk of high alcohol consumption. I.e., if one is in a romantic relationship or has extracurricular activities, they are less prone for high alcohol consumption.

m <- glm(high_use ~ freetime, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   259
##    TRUE    111

It seems that the model fitted only on freetime always predicts FALSE. We can calculate our model’s accuracy from the confusion matrix by:

\[ acc = \frac{(TP + TN)}{(P + N)} \]

For our model, that would result in \(acc = \frac{259}{370} = 0.7\)

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3

Our loss function returns a loss of 0.3, indicating that, on average, we predict the wrong label by a probability of 30%.

library(dplyr)
alc %>% 
  group_by(high_use) %>%
  summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
##   high_use percent
##   <lgl>      <dbl>
## 1 FALSE         70
## 2 TRUE          30

The data consists of 70% of FALSE labels. Thus, if we would always guess the majority class, like our model actually does, our baseline accuracy would be 0.7. Thus, any useful model should be able to obtain accuracy higher than that.

Bonus exercise 1

m2 <- glm(high_use ~ sex + failures + absences + famrel + goout + health, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m2, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2081081

I can get loss of approximately 0.2 in cross-validation by adding sex, failures, absences, famrel, goout, and health to the model features.


Clustering and classification

date()
## [1] "Wed Nov 23 17:03:49 2022"

This chapter focuses on a set of clustering methods, designed for visualizing and exploring statistical data. In general, we want to train a model that can position data points to different clusters (or groups) based on their characteristics. After the model is trained, it can be used for unseen data to classify that data to the learnt clusters. The most popular of these methods is k-means clustering.

Assignments

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset represents different explanatory variables related to the housing value in suburbs of Boston. The data consists of 506 observations and 14 variables. In practice, the variables are such that can be assumed to impact housing values, e.g., crime rate, air quality, average number of rooms, median value of owner-occupied homes, etc. More information about the data can be found from here.

# plot matrix of the variables
pairs(Boston)

Plot matrix shows a graphical overview of the data. However, this visualization is rather difficult to read. Correlation matrix provides a slightly clearer visualization.

library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

From the correlation matrix, we can interpret the relationships between different variables. For instance, it seems that proportion of non-retail business acres per town, indicated by indus, has a rather high negative correlation with weighted mean of distances to five Boston employment centres, indicated by dis. Unsurprisingly, indus also seems to have a high (positive) correlation with nitrogen oxides concentration (parts per 10 million), indicated by nox. Median value of owned homes (medv) seems to correlate positively with number of rooms (rm), and negatively with lower status of the population (lstat).

What I think we can read from this data is, for instance, that in areas where the median value of owned homes (medv) is high, we see less crime, less noon-retail businesses, better air quality, more rooms in houses, less pupils per teacher, etc, based on the correlations between the variables.

Summaries of the variables:

# summaries of the variables of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Next, we standardize the values with:

\[ scaled(x) = \frac{x-mean(x)}{std(x)} \]

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next, we use the scaled values of the crime rate to create bins where we store crime rates as categorical variables (scale of four from low crime rate to high crime rate). Furthermore, we drom the old crime rates from the data, and replace them with the crime rate quantiles.

boston_scaled$crim <- as.numeric(boston_scaled$crim)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
             breaks = bins,
             include.lowest = TRUE,
             label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Before we can train a model for k-means clustering (or any other ML technique), we should split the data to train and test sets. Here, we create a split of 80% of training data, and 20% of test data. For tuning the model, we might want to also get development data, so our split could be 80-10-10, but for now, we keep with 80-20. To the best of my understanding, from now on, each row in the data (consisting of observations for 14 variables) will become a 13-dimensional vector that represents one data point for our model. The output, predicted from the 13-dimensional vector should then be the crime rate class (4 classes: low, med_low, med_high, high).

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

nrow(train); nrow(test);
## [1] 404
## [1] 102

Next, we fit the LDA model to the training data.

# LDA
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(crime)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 2)

In the 2-dimensional visualization of the Linear Discriminant Analysis, we see that the crime clusters are rather clear, high crime rates being in the right part of the space, whereas med_high is positioned to the bottom of the space. Low crime rate is at the top left of the space. Med_high and med_low seem to create a larger cloud that is not very clearly separated, likely because their values are also close to each others when we generated the bins of the crime rates. Additionally, from the visualization, we can observe where variables are clustered by the model. We see that the model associates index of accessibility to radial highways (rad) quite strongly with high crime rate, whereas proportion of residential land zoned for lots over 25,000 sq.ft. (zn) is quite strongly associated with low crime rate.

Let us then see how the clustering model generalizes to unseen data.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
predictions <- lda.pred$class
table(correct = correct_classes, predicted = predictions)
##           predicted
## correct    low med_low med_high high
##   low       15      12        1    0
##   med_low    3      14        7    0
##   med_high   0       6       13    1
##   high       0       0        1   29

The model does not seem very robust, and it predicts wrong labels especially when the correct label is med_low. However, those are the more difficult examples to predict correctly.

Finally, let us calculate the distances between the observations, and run k-means clustering.

library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))

# euclidean distance
dist <- dist(boston_scaled, method = "euclidean")

# look at the summary of the distances
summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-means

set.seed(13)

# k-means clustering with 4 clusters
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

The clusters are not very visible from such a small scatter plot matrix.

pairs(boston_scaled[6:10], col = km$cluster)

Here we notice that some of the clusters are on top of each others, and the number of clusters is potentially too high. Let’s attempt at finding the optimal number of clusters.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Above, the within cluster sum of squares (WCSS) provides us a way to find the optimal number of clusters. Practically, the optimal number is that where the value changes radically. From the visualization, we can approximate that the optimal number of clusters is 2, because at that point the value changes quite a lot.

# New k-means with 2 clusters
km <- kmeans(boston_scaled, centers = 2)

# plot the scaled Boston dataset with 2 clusters
pairs(Boston, col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

Now the clusters are perhaps more visible, and, hopefully, in their own positions in the space.


(more chapters to be added similarly as we proceed with the course!)


  1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.↩︎